Wrangle and Analyze Data

Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent."

WeRateDogs has over 4 million followers and has received international media coverage. WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.

Step 1: Gathering Data

Three types of dataset will be used;

1- twitter_df : Loaded data from twitter_archive_enhanced.csv

2- images_df : Loaded data from image_predictions.tsv

3- tweet_json : Twitter API & json

Source 1: csv data

Gathering Source 2: tsv data , image prediction, from url

Gathering Source 3: json txt data from twitter API

auth = tweepy.OAuthHandler('X', 'X')

auth.set_access_token('X', 'X') api = tweepy.API(auth, parser = tweepy.parsers.JSONParser(), wait_on_rate_limit = True)

Step 2: Assessing Data

To meet specifications, the following issues must be assessed.

You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.

Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.

The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.

You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

Assesing Source 1 : twitter_df

We have got 2.356 entries and 17 columns. Total memory usage of the dataframe is 313.0+ KB.

Assesing Source 2 : images_df

Assesing Source 3 : tweet_json

Cleaning

Copy all of the datasets to save the originals

Retweeted tweets based unique on user_id : twitter_df_clean

Retweet tweets : json_df_clean

Define :

Lack of quality and how to fix it:

twitter_df:

images_df:

json_df:

Lack of tidiness and how to fix it:

Quality 1- Replace None values with NaN in "doggo","floofer","pupper","puppo" columns and the whole blanks in twitter_df_clean in case there is.

Tideness 1- Collecting all dog types in one column in twitter_df_clean.

Test:

Filling the empty spaces with NaN values.

Test:

Quality 2: Replace the words in twitter_df_clean['name'];

Test:

Quality 3: Splitting the "timestamp" column, aiming for only having "date" column in twitter_df_clean.

Test:

Quality 4: Correct 'rating_denominator' column as max 10 in twitter_df_clean.

Test:

Quality 5 : Delete retweeted_status_user_id that are not null from twitter_df_clean.

Test:

Quality 6: Delete columns that will not used - from all datasets.

twitter_df_clean :

Test:

json_df_clean:

Test:

images_df_clean:

Test:

Quality 7- Renaming nondescriptive column names in images_df_clean.

Test:

Quality 8: Remaming "text" with "tweet" in twitter_df_clean.

Test:

Quality 9: Renaming "id" as "tweet_id".

Test:

Quality 10: Deleting duplicated urls from images_df_clean.

Test:

Quality 11: Correcting datatypes.

Quality 12: Check whether there is data on dates after 2017-08-01 to clean.

"You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used."

There is no data after 2017-08-01.

Tideness 2- Merging 3 dataframes.

Test :

Checked :

Saving the master dataset as 'twitter_archive_master.csv'

Analyzing and Visualizing Data

You must produce at least three (3) insights and one (1) visualization. You must clearly document the piece of assessed and cleaned (if necessary) data used to make each analysis and visualization.

descriptive_statistics results indicates :

Which dog is the outlier regarding to ratings?

Most common breed on We Rating Dogs

Shetland sheepdog breed is the most common one.

Are predictions and images matching?

img has the highest and img2 has the second highest probabilty of being the dog is a Shetland sheepdog regarding to third_confidence, and they are.

img3 has the highest probablity of the dog is a Shetland sheepdog regarding to first_confidence(on avg. approximately 60%). This is also true.

Most popular dog types on We Rating Dogs

66% of the dog types is pupper. Tweets are mostly about puppers, i.e. puppies.

As the second common type, doggo, is only 20% of the total dog types.

The most popular dog name : Charlie